Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save and parse Arkane species with all statmech properties using YAML format #1402

Merged
merged 16 commits into from
Jan 22, 2019

Conversation

alongd
Copy link
Member

@alongd alongd commented Jun 13, 2018

We'd like to be able to conveniently use previously calculated statmech data in Arkane (prev. Cantherm). The purpose is to have an online repository of such files, from which users can pick the desired species at the appropriate level of theory (if available). This PR sets the ground by allowing Arkane to save and parse such files.

@codecov
Copy link

codecov bot commented Jun 13, 2018

Codecov Report

Merging #1402 into master will decrease coverage by 0.04%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1402      +/-   ##
==========================================
- Coverage   42.07%   42.02%   -0.05%     
==========================================
  Files         165      165              
  Lines       27824    27867      +43     
  Branches     5668     5680      +12     
==========================================
+ Hits        11706    11711       +5     
- Misses      15332    15371      +39     
+ Partials      786      785       -1
Impacted Files Coverage Δ
rmgpy/quantity.py 0% <0%> (ø) ⬆️
rmgpy/data/kinetics/family.py 58.58% <0%> (+0.29%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c61d2ef...4c77d09. Read the comment docs.

@alongd alongd force-pushed the read_spc branch 3 times, most recently from 39614f3 to 7ac6569 Compare June 13, 2018 18:59
@mliu49
Copy link
Contributor

mliu49 commented Jun 13, 2018

If this is going to be a brand new file type, I strongly suggest using an established format such as yaml, which you can parse using pyyaml to give a python dict. Using exec on user provided input files is a security risk since we could be executing arbitrary code. It would be best to convert all of our input files and database files to not require exec, but it'll take time.

Regarding the idea of having a database of statmech info, it would again be best if it's not a plain-text database like RMG-database is currently. Kehang's thermo central database uses mongoDB, which is an option, especially since it seems like it would overlap or be a part of this future statmech database. There are also tons of other options, but we should definitely put some thought into the database design.

@alongd
Copy link
Member Author

alongd commented Jun 15, 2018

I changed the format into .yml.
Still, we need to use eval() (instead of execute()), because loading the YML file only gives dictionaries of strings, and the objects must be reconstructed.
I suggest we add more thorough checks to the constructor functions I added to common.py that the string we're evaluating is indeed what we think it is (contains several key words, under a certain maximum length, etc.). Other suggestions are welcomed!

Copy link
Contributor

@mliu49 mliu49 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like where this is going! I made a few comments, feel free to push back on any if you have other thoughts.

from rmgpy.molecule.translator import toInChI, toInChIKey
from rmgpy.thermo import NASA, NASAPolynomial, Wilhoit, ThermoData
from rmgpy.statmech.conformer import Conformer
from rmgpy.statmech import IdealGasTranslation, LinearRotor, NonlinearRotor, HarmonicOscillator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some of these imports are unused, so you could probably remove them.

Also, imports should be grouped by standard imports, third-party libraries, then local imports (PEP 8). It's not that important, but something to consider. For example:

import logging
import os.path
import re
import time

import numpy
import yaml
from yaml import ...

import rmgpy.constants as constants
from rmgpy import ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for informing me of the structure, I'll adhere to it.
The imports here are being used implicitly by the eval() function which creates objects from these classes if they are present in the .yml file, so I think we should leave them (otherwise it errors). Since advanced text editors mark them as unused, I think it's a good idea to add a comment next to them explaining why they are kept. I'll happily reorganize imports across Cantherm.

' kJ/mol. This can cause significant errors in your computed rate constants. ')


def determine_molecular_weight(species):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating this method, you should modify the getter for Species.molecularWeight.

First, it would be more convenient for me if you could update the syntax to match my PR (this commit). Then, you can change the get method:

@property
def molecularWeight(self):
    if self._molecularWeight is None:
        self._molecularWeight = quantity.Mass(self.molecule[0].getMolecularWeight(), 'kg/mol')
    return self._molecularWeight

Once you do that, you can access the Species.molecularWeight property regardless of whether or not it's been set.

Copy link
Member Author

@alongd alongd Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment!
I'll implement it in the Species __init__():

def __init__(self, ..., molecularWeight=None, ...):
        if self.molecularWeight is None and self.molecule is not None and len(self.molecule) > 0:
            self._molecularWeight = quantity.Mass(self.molecule[0].getMolecularWeight(), 'kg/mol')
        else:
            self.molecularWeight = molecularWeight

Shall I still change the syntax of molecularWeight? I can leave it untouched so #1329 will have less conflicts

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second though, your recommendation seems better if for some reason an empty Species object is first created, then a structure is assigned.

d['thermo'] = species.thermo
d['chemkin_thermo_string'] = chemkin_thermo_string

self.species_dictionary = d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be called species_dictionary. I also think that you could save all of these parameters as normal attributes, then create a method as_dictionary if you need them as a dictionary.

If you do that, I think you also wouldn't need the update method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.
I think I still need something like the update() method, though, to only append the species attributes to this object (the object is initiated with non-species-specific attributes such as author and level_of_theory which are kept)

SMILES: '[O]OS'
adjacencyList: 'multiplicity 2

1 S u0 p2 c0 {2,S} {4,S}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why there are extra blank lines in the adjlist?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why, it bugs me as well. But I verified that the adjList is being read correctly.
If I simply print it during the run (either using print or logging.info) it looks correct with no extra line brakes.

transport_data: TransportData(epsilon=(3625.11,'J/mol'), sigma=(5.7,'angstrom'))
use_atom_corrections: true
use_bond_corrections: false
use_hindered_rotors: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these parameters don't seem like they would be important to store, e.g. use_atom_corrections, use_bond_corrections, use_hindered_rotors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least use_bond_corrections is crucial, the rest are just "nice to have".

use_bond_corrections: This might impact the decision of the user of whether or not to use this file. If the user wants to use this species in a PES with other species calculated at different levels of theory, it is important/recommended to use bond corrections. On the other hand, if all species were calculated at the same level, it is better practice not to use bond corrections (which don't capture all bond types, and are often ambiguous for TSs).

use_hindered_rotors: This could also be directly inferred from the Conformer object. However, for a quick assessment, it's nice to have it explicitly written.

use_atom_corrections: Can be inferred from the atom_energies attribute, I'm OK with removing it since using this feature is less common.

spec.conformer = Conformer(E0=E0, modes=modes, spinMultiplicity=spinMultiplicity, opticalIsomers=opticalIsomers)
if molecularWeight is None:
if is_pdep():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work if the species block comes before the pdep block?

Also, if you refactor the molecular weight implementation based on my other comment, would that eliminate the need for this if block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified

"""
Load the species with all statMech data from the .yml file
"""
global jobList
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like you use jobList in this function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

coordinates = np.ndarray(shape=(len(number), 3))
for i, coorlist in enumerate(job.species.conformer.coordinates.value_si):
coordinates[i] = [coor for coor in coorlist]
mol = Molecule().fromXYZ(atomicNums=number, coordinates=coordinates)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section seems oddly coded. Was it done intentionally? It seems like the following would be sufficient:

number = job.species.conformer.number.value_si.tolist()
coordinates = job.species.conformer.coordinates.value_si.tolist()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the issue here is that we must pass a np.ndarray type to Molecule().fromXYZ()
The data generated from .tolist() throws an error.
I'll be happy to consider a different alternative to the non straight forward code I used, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I misinterpreted it, I thought initially that the conformer attributes were np.ndarray objects, and you were converting them into lists.

If the conformer attributes are np.ndarray, it seems you could use them directly. If they're lists, then couldn't you use np.array(list_input) to convert it to an np.ndarray?

filename, file_extension = os.path.splitext(self.path)
if 'yml' in file_extension or 'yaml' in file_extension:
self.load_species_from_database = True
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you could just call load_species_from_database_file here instead of main.py, and it would also eliminate the need for the load_species_from_database attribute.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, will change accordingly

@@ -188,6 +188,30 @@ def save(self, outputFile):
f.write('\n')
f.close()

return chem_string

def save_species_to_database_file(self, path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be more appropriate as a method of CanthermSpecies. Let me know what you think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

@alongd
Copy link
Member Author

alongd commented Jun 22, 2018

@mliu49, Thanks for all the valuable comments!
I addressed them in fixup commits. At some point I didn't bother correctly naming the fixup commits due to the relocation of methods between files, I think I'll re-commit it all instead of squashing.
Let me know if you have any additional comments.

@alongd
Copy link
Member Author

alongd commented Jun 26, 2018

@mliu49, can I squash this branch?

@mliu49
Copy link
Contributor

mliu49 commented Jun 26, 2018

Yeah, you can go ahead and squash. I guess this might get replaced by the new method though, if that turns out ok.

rmgpy/rmg_obj.py Outdated
###############################################################################
import numpy as np

class RMG_Obj(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple comments on naming:

  • Class names should be CamelCase, though I think we can make an exception for RMG. I would probably call this RMGObject.
  • Module names are ideally one word. Perhaps something like object, base, or common?
  • Variable names should be lowercase. I think D should be renamed to something else, like output or even output_dict.

Also, could you end the file with a newline?

@mliu49
Copy link
Contributor

mliu49 commented Jun 27, 2018

I don't think we discussed how to handle Quantity objects with the new dict approach - we would need some way to save and load the units.

@mjohnson541
Copy link
Contributor

They're our own objects right? So can't we just make them inherit from RMGObject?

def __init__(self):
pass
def as_dict(self):
D = self.__dict__.copy()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this addition, @mjohnson541 !
I think that self.__dict__ contains methods as well, and we should first check if not callable(value) before adding entries to the output dictionary.

@alongd alongd force-pushed the read_spc branch 2 times, most recently from 322feb1 to 4bfbbb8 Compare July 5, 2018 18:01
@alongd
Copy link
Member Author

alongd commented Jul 5, 2018

I rewrote this branch using the as_dict() and make_object() methods we discussed for YAML loading and parsing.
Thanks @mjohnson541 and @mliu49 for the discussion and help with writing the methods.
The branch is now ready for another review cycle.

I'll note that since cythonized classes cannot inherit from non-cythonized classes, and vice versa, I had to make two parent classes: RMGObject which is cythonized, and RMGPYObject which isn't.
Quantity (Scalar & Array) and the thermo classes (NASA & Wilhoit) were not set to inherit from the parent objects. I think it's OK for quantity, but we could theoretically modify the way attributes are saved in the thermo classes so that they too can inherit from RMGObject. But I didn't think it was that important when I implemented it.

@alongd alongd force-pushed the read_spc branch 2 times, most recently from b132315 to ad4e8c9 Compare July 6, 2018 14:33
@mliu49
Copy link
Contributor

mliu49 commented Jul 6, 2018

Are you sure that Python classes can't subclass Cython classes? It should be possible according to the Cython documentation (section 3.5).

@alongd
Copy link
Member Author

alongd commented Jan 8, 2019

rebased. @mliu49, let me know when I can squash.

@alongd
Copy link
Member Author

alongd commented Jan 21, 2019

@mliu49, let me know when to squash the fixups to finalize the PR

Copy link
Contributor

@mliu49 mliu49 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a few more comments. You can go ahead and squash everything else.

arkane/input.py Outdated
" cannot be reconstructed.".format(spec.label))
if spec.molecularWeight is None and is_pdep(jobList):
# If one of the jobs is pdep and no molecular weight is given or calculated, raise an error
raise ValueError("No molecularWeight was entered for species {0}. Since a structure wasn't given"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, my suggestion might not have been correct, since this now misses the case where a structure is not provided. One possibility:

if molecularWeight is not None:
    spec.molecularWeight = molecularWeight
elif spec.molecularWeight is None and is_pdep(jobList):
    raise ValueError()

There shouldn't be any need to check whether structure exists, because the the Species.molecularWeight property checks if it has a molecule attribute, which will be empty if structure was empty.

@@ -1,10 +1,11 @@
# cython: embedsignature=True, cdivision=True
#!/usr/bin/env python
# -*- coding: utf-8 -*-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually isn't necessary either, since this is technically not a python file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean the .pyx or the .pxd file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both

arkane/qchem.py Outdated
@@ -273,12 +273,13 @@ def loadEnergy(self, frequencyScaleFactor=1.):
"""
E0 = None
with open(self.path, 'r') as f:
for line in f:
lines = f.readlines()
for line in lines:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original implementation is preferable because it treats the file as an iterator, so only one line is loaded at a time. If you use readlines(), then the entire file has to be loaded into memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think this isn't related to this PR (sorry) so I removed it, we could discuss it at another time (I think it might be necessary for running Arkane through ARC). Thanks for bringing this up, I removed the commit.

@alongd
Copy link
Member Author

alongd commented Jan 22, 2019

OK, done and squashed

Copy link
Contributor

@mliu49 mliu49 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@alongd alongd merged commit 3a7b901 into master Jan 22, 2019
@alongd alongd deleted the read_spc branch January 22, 2019 15:42
@mliu49 mliu49 mentioned this pull request May 15, 2019
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants